I am pleased to present this case study on Frito Lay attrition, which aims to analyze and derive actionable insights from the workforce data at Frito Lay. The purpose of this study is to understand the factors influencing employee attrition, develop predictive models for attrition risk, and evaluate the performance of the predictive models.
library(dplyr)
library(ggplot2)
library(caret)
library(aws.s3)
library(RCurl)
library(readr)
library(base)
library(tidyverse)
library(naniar)
library(class)
library(GGally)
library(e1071)
library(car)
library(fastDummies)
Here, loading in the data from the AWS S3 Bucket. I did some slight clean up of the data, to exclude the “Over18” column in the original data set and the Attrition test data set. The column name for ID in the No Salary data set was not the same, so I adjusted that. Also, I did a check for any missing values. There are no missing values in any of the data sets, so there is no need for imputing or deleting of rows.
Sys.setenv("AWS_ACCESS_KEY_ID" = "AKIATJ37QGJGSK4DGWN4" ,
"AWS_SECRET_ACCESS_KEY" = "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",
"AWS_DEFAULT_REGION" = "us-east-2")
# Load the Attrition data set from S3
s3_path <- "s3://msds.ds.6306.2/CaseStudy2-data.csv"
# Read the Attrition Data CSV file from S3
Attritiondata <- aws.s3::s3read_using(read_csv, object = s3_path)
head(Attritiondata,5)
## # A tibble: 5 × 36
## ID Age Attrition BusinessTravel DailyRate Department DistanceFromHome
## <dbl> <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 32 No Travel_Rarely 117 Sales 13
## 2 2 40 No Travel_Rarely 1308 Research &… 14
## 3 3 35 No Travel_Frequently 200 Research &… 18
## 4 4 32 No Travel_Rarely 801 Sales 1
## 5 5 24 No Travel_Frequently 567 Research &… 2
## # ℹ 29 more variables: Education <dbl>, EducationField <chr>,
## # EmployeeCount <dbl>, EmployeeNumber <dbl>, EnvironmentSatisfaction <dbl>,
## # Gender <chr>, HourlyRate <dbl>, JobInvolvement <dbl>, JobLevel <dbl>,
## # JobRole <chr>, JobSatisfaction <dbl>, MaritalStatus <chr>,
## # MonthlyIncome <dbl>, MonthlyRate <dbl>, NumCompaniesWorked <dbl>,
## # Over18 <chr>, OverTime <chr>, PercentSalaryHike <dbl>,
## # PerformanceRating <dbl>, RelationshipSatisfaction <dbl>, …
#summary(Attritiondata)
vis_miss(Attritiondata) #checking for no missing values
#Read CSV "testing" files from S3
# Reading in of NoSalary Dataset from S3 Bucket
NoSalary<-read.table( textConnection(getURL
("https://msdsds6306.s3.us-east-2.amazonaws.com/CaseStudy2CompSet+No+Salary.csv"
)), sep=",", header=TRUE)
head(NoSalary,5)
## ï..ID Age Attrition BusinessTravel DailyRate Department
## 1 871 43 No Travel_Frequently 1422 Sales
## 2 872 33 No Travel_Rarely 461 Research & Development
## 3 873 55 Yes Travel_Rarely 267 Sales
## 4 874 36 No Non-Travel 1351 Research & Development
## 5 875 27 No Travel_Rarely 1302 Research & Development
## DistanceFromHome Education EducationField EmployeeCount EmployeeNumber
## 1 2 4 Life Sciences 1 1849
## 2 13 1 Life Sciences 1 995
## 3 13 4 Marketing 1 1372
## 4 9 4 Life Sciences 1 1949
## 5 19 3 Other 1 1619
## EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 1 1 Male 92 3 2
## 2 2 Female 53 3 1
## 3 1 Male 85 4 4
## 4 1 Male 66 4 1
## 5 4 Male 67 2 1
## JobRole JobSatisfaction MaritalStatus MonthlyRate
## 1 Sales Executive 4 Married 19246
## 2 Research Scientist 4 Single 17241
## 3 Sales Executive 3 Single 9277
## 4 Laboratory Technician 2 Married 9238
## 5 Laboratory Technician 1 Divorced 16290
## NumCompaniesWorked Over18 OverTime PercentSalaryHike PerformanceRating
## 1 1 Y No 20 4
## 2 3 Y No 18 3
## 3 6 Y Yes 17 3
## 4 1 Y No 22 4
## 5 1 Y No 11 3
## RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears
## 1 3 80 1 7
## 2 1 80 0 5
## 3 3 80 0 24
## 4 2 80 0 5
## 5 1 80 2 7
## TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
## 1 5 3 7 7
## 2 4 3 3 2
## 3 2 2 19 7
## 4 3 3 5 4
## 5 3 3 7 7
## YearsSinceLastPromotion YearsWithCurrManager
## 1 7 7
## 2 0 2
## 3 3 8
## 4 0 2
## 5 0 7
#summary(NoSalary)
vis_miss(NoSalary)
AttritionTest<- read.table(textConnection(getURL
("https://msdsds6306.s3.us-east-2.amazonaws.com/CaseStudy2CompSet+No+Attrition.csv"
)), sep=",", header=TRUE)
head(AttritionTest,5)
## ID Age BusinessTravel DailyRate Department DistanceFromHome
## 1 1171 35 Travel_Rarely 750 Research & Development 28
## 2 1172 33 Travel_Rarely 147 Human Resources 2
## 3 1173 26 Travel_Rarely 1330 Research & Development 21
## 4 1174 55 Travel_Rarely 1311 Research & Development 2
## 5 1175 29 Travel_Rarely 1246 Sales 19
## Education EducationField EmployeeCount EmployeeNumber
## 1 3 Life Sciences 1 1596
## 2 3 Human Resources 1 1207
## 3 3 Medical 1 1107
## 4 3 Life Sciences 1 505
## 5 3 Life Sciences 1 1497
## EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 1 2 Male 46 4 2
## 2 2 Male 99 3 1
## 3 1 Male 37 3 1
## 4 3 Female 97 3 4
## 5 3 Male 77 2 2
## JobRole JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate
## 1 Laboratory Technician 3 Married 3407 25348
## 2 Human Resources 3 Married 3600 8429
## 3 Laboratory Technician 3 Divorced 2377 19373
## 4 Manager 4 Single 16659 23258
## 5 Sales Executive 3 Divorced 8620 23757
## NumCompaniesWorked Over18 OverTime PercentSalaryHike PerformanceRating
## 1 1 Y No 17 3
## 2 1 Y No 13 3
## 3 1 Y No 20 4
## 4 2 Y Yes 13 3
## 5 1 Y No 14 3
## RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears
## 1 4 80 2 10
## 2 4 80 1 5
## 3 3 80 1 1
## 4 3 80 0 30
## 5 3 80 2 10
## TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
## 1 3 2 10 9
## 2 2 3 5 4
## 3 0 2 1 1
## 4 2 3 5 4
## 5 3 3 10 7
## YearsSinceLastPromotion YearsWithCurrManager
## 1 6 8
## 2 1 4
## 3 0 0
## 4 1 2
## 5 0 4
#summary(AttritionTest)
vis_miss(AttritionTest)
Attritiondata <- subset(Attritiondata, select = -c(Over18))
head(Attritiondata,5)
## # A tibble: 5 × 35
## ID Age Attrition BusinessTravel DailyRate Department DistanceFromHome
## <dbl> <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 32 No Travel_Rarely 117 Sales 13
## 2 2 40 No Travel_Rarely 1308 Research &… 14
## 3 3 35 No Travel_Frequently 200 Research &… 18
## 4 4 32 No Travel_Rarely 801 Sales 1
## 5 5 24 No Travel_Frequently 567 Research &… 2
## # ℹ 28 more variables: Education <dbl>, EducationField <chr>,
## # EmployeeCount <dbl>, EmployeeNumber <dbl>, EnvironmentSatisfaction <dbl>,
## # Gender <chr>, HourlyRate <dbl>, JobInvolvement <dbl>, JobLevel <dbl>,
## # JobRole <chr>, JobSatisfaction <dbl>, MaritalStatus <chr>,
## # MonthlyIncome <dbl>, MonthlyRate <dbl>, NumCompaniesWorked <dbl>,
## # OverTime <chr>, PercentSalaryHike <dbl>, PerformanceRating <dbl>,
## # RelationshipSatisfaction <dbl>, StandardHours <dbl>, …
colnames(NoSalary)[colnames(NoSalary)=="ï..ID"] <- "ID"
colnames(NoSalary)
## [1] "ID" "Age"
## [3] "Attrition" "BusinessTravel"
## [5] "DailyRate" "Department"
## [7] "DistanceFromHome" "Education"
## [9] "EducationField" "EmployeeCount"
## [11] "EmployeeNumber" "EnvironmentSatisfaction"
## [13] "Gender" "HourlyRate"
## [15] "JobInvolvement" "JobLevel"
## [17] "JobRole" "JobSatisfaction"
## [19] "MaritalStatus" "MonthlyRate"
## [21] "NumCompaniesWorked" "Over18"
## [23] "OverTime" "PercentSalaryHike"
## [25] "PerformanceRating" "RelationshipSatisfaction"
## [27] "StandardHours" "StockOptionLevel"
## [29] "TotalWorkingYears" "TrainingTimesLastYear"
## [31] "WorkLifeBalance" "YearsAtCompany"
## [33] "YearsInCurrentRole" "YearsSinceLastPromotion"
## [35] "YearsWithCurrManager"
AttritionTest <- subset(AttritionTest, select = -c(Over18))
head(AttritionTest,5)
## ID Age BusinessTravel DailyRate Department DistanceFromHome
## 1 1171 35 Travel_Rarely 750 Research & Development 28
## 2 1172 33 Travel_Rarely 147 Human Resources 2
## 3 1173 26 Travel_Rarely 1330 Research & Development 21
## 4 1174 55 Travel_Rarely 1311 Research & Development 2
## 5 1175 29 Travel_Rarely 1246 Sales 19
## Education EducationField EmployeeCount EmployeeNumber
## 1 3 Life Sciences 1 1596
## 2 3 Human Resources 1 1207
## 3 3 Medical 1 1107
## 4 3 Life Sciences 1 505
## 5 3 Life Sciences 1 1497
## EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 1 2 Male 46 4 2
## 2 2 Male 99 3 1
## 3 1 Male 37 3 1
## 4 3 Female 97 3 4
## 5 3 Male 77 2 2
## JobRole JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate
## 1 Laboratory Technician 3 Married 3407 25348
## 2 Human Resources 3 Married 3600 8429
## 3 Laboratory Technician 3 Divorced 2377 19373
## 4 Manager 4 Single 16659 23258
## 5 Sales Executive 3 Divorced 8620 23757
## NumCompaniesWorked OverTime PercentSalaryHike PerformanceRating
## 1 1 No 17 3
## 2 1 No 13 3
## 3 1 No 20 4
## 4 2 Yes 13 3
## 5 1 No 14 3
## RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears
## 1 4 80 2 10
## 2 4 80 1 5
## 3 3 80 1 1
## 4 3 80 0 30
## 5 3 80 2 10
## TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
## 1 3 2 10 9
## 2 2 3 5 4
## 3 0 2 1 1
## 4 2 3 5 4
## 5 3 3 10 7
## YearsSinceLastPromotion YearsWithCurrManager
## 1 6 8
## 2 1 4
## 3 0 0
## 4 1 2
## 5 0 4
ggplot(data=Attritiondata, aes(x=JobSatisfaction)) +geom_bar(position="dodge") + theme_minimal() + ggtitle("Overall Job Satisfaction")
Here, we look at overall Job Satisfaction among the employees. It looks
like majority of employees seem to be satisfied with their job, with a
handful not being as satisfied. It is a left skewed histogram.
ggplot(data=Attritiondata,aes(x=MonthlyIncome, y=Age)) + geom_point(position="jitter") + facet_wrap(~MaritalStatus)+geom_smooth(method="loess") + ggtitle("Monthly Income and Age categorized by Marital Status")
Here, we see Monthly Income by Age categorized by Marital Status.
Starting off, it looks like Divorced and Married people tend to make
more as their age increases. However, the same positive trend is seen
for Single people, but they do not make as much as their Married or
Divorced coworkers. But there is definitely a positive trend between age
and monthly income.
ggplot(data = Attritiondata, aes(x = MonthlyIncome, y = Age, color = JobInvolvement)) +
geom_point(position = "jitter") +
geom_smooth(method = lm) +
ggtitle("Job Involvement and Monthly Income") + facet_wrap(~JobInvolvement)
Looking at job involvement and monthly Income, it looks most employees
regardless of Job Involvement have a positive correlation between Age
and Monthly Income. I was trying to see if those who were more involved
in their job made a higher income, but that does not seem to be the case
at first glance. It looks like a lot of employees are pretty involved in
their jobs (Job Involvement levels 2 and 3).
ggplot(data = Attritiondata, aes(x = MonthlyIncome, y = Age, color = interaction(JobInvolvement))) +
geom_point(position = "jitter") +
geom_smooth(method = lm, se = FALSE) +
ggtitle("Job Involvement and Monthly Income")
When I plotted all 4 job levels with their respective linear regression
lines, it looks like those with least job involvement (Job Involvement
1) start to make more after crossing 40 years old and a monthly income
of 10,000. It seems like Job Involvement 3 make the most starting off up
to a monthly income of 10,000 and ~40 years old, and start to make the
least as they near 50 years old. Job Involvements 2 and 4 linear
regression lines fall in between the 1 and 4 lines initially and towards
the end, they are very close to each otehr, and seem to overlap.
ggplot(data=Attritiondata,aes(x=MonthlyIncome,y=NumCompaniesWorked)) + geom_point(position="jitter") + geom_smooth(method=lm) + ggtitle("Number of Companies Worked and Monthly Income")
There is positive correlation between Monthly Income and Number of
Companies worked.
ggplot(data = Attritiondata, aes(x = MonthlyIncome)) + geom_histogram() + ggtitle("Monthly Income Histogram")
Looking at a histogram of Monthly Income, it looks to be right skewed.
The mode is less than the median which is less than the mean.
ggplot(data=Attritiondata,aes(x=MonthlyIncome)) + geom_histogram() + ggtitle("Monthly Income Histogram Categorized by Gender") + facet_wrap(~Gender)
Looking at monthly income categorized by Gender, it looks like there are
more male datapoints in the dataset than femalses. The mode is higher
for men than it is for women. Both histograms are right skewed.
ggplot(data=Attritiondata,aes(x=MonthlyIncome, y=DistanceFromHome)) + geom_point(position="jitter") + geom_smooth(method=loess) + ggtitle("Monthly Income and Distance from Home")
There seems to be a negative correlation between distance from home and
monthly income. The plot and loess curve imply that as distance from
home initially increases, monthly income also increases, but after a
monthly income of 15,000 is exceeded, there is an overall decrease in
the distance from home. The curve resembles a concave curve, with the
downward slight ‘w’ shape. Looking at this graph, I would interpret
distance from home to be in miles, because if this is in kilometers, it
doesn’t seem to make logical sense.
ggplot(data=Attritiondata,aes(x=MonthlyIncome, y=DistanceFromHome)) + geom_point(position="jitter") + geom_smooth(method=loess) + facet_wrap(~Gender) + ggtitle("Monthly Income and Distance from Home categorized by Gender")
There seems to be a more prominent downturned ‘w’ shape for women than
for men. But the overall relationship is as mentioned above (between
distance from home and Monthly Income).
ggplot(data=Attritiondata, aes(x=Department, y=JobSatisfaction, color = Gender)) + geom_point(position ="jitter") + facet_wrap(~Gender) + ggtitle("Department and Job Satisfaction by Gender")
Job Satisfaction seems to be higher for Research and Development for
both Males and Females. There are less data points for those in Human
Resources so a clear defined relationship can’t be concluded. There
seems to be decent job satisfaction for those in Sales too for both
genders.
ggplot(data=Attritiondata,aes(x=Age, y=TotalWorkingYears, color = MonthlyIncome)) + geom_point(position = "jitter") + ggtitle("Age and Total Working Years with Monthly Income")
This plot confirms that as Age increases, total working years also
increase, and monthly income seems to follow the same trend as well.
There is a positive correlation between all 3 variables - Age, Total
working years, and monthly income.
I filtered the data to look at specifically those who Attrited, to find some insights and relationships between the variables.
attrition_yes <- dplyr::filter(Attritiondata, Attrition == "Yes")
ggplot(data = attrition_yes, aes(x = Department, fill = Gender))+ geom_bar(position = "dodge") +
ggtitle("Attrition by Department and Gender") + theme_minimal()
Of those who left the company, many men were in Research &
Development and Sales, while there were equal amounts of women in
Research & Development and Sales. There are equal amounts of men and
women from Human resources who left the company.
ggplot(data = attrition_yes, aes(y = JobSatisfaction , x = DistanceFromHome, color = MonthlyIncome))+
geom_point(position = "jitter") + theme_minimal() + geom_smooth(method =lm) + ggtitle("Attrition by Distance from Home, Job Satisfaction with Monthly Income")
Of those who left the company, it seems likes a lot of them were making
under $10,000 monthly. There seems to be a negative relationship between
Job Satisfaction and Distance from home (as distance from home increases
in miles, the job satisfaction goes down).
ggplot(data = attrition_yes, aes(y=JobLevel, x = MonthlyIncome, color = Age))+
geom_point(position="jitter")+geom_smooth(method=lm) + ggtitle("Monthly Income and Job Level with Age")
Of those who left the company, there is a positive relationship between
job level and Monthly income. Age seems to be scattered, but at a job
level of around 1, and monthly income less than 5000, the age group
seems to have employees in their 20s.
ggplot(data = attrition_yes, aes(y=JobLevel, x = Age, color = Gender))+
geom_point(position="jitter")+geom_smooth(method=lm) + ggtitle("Attrition Job Level by Age and Gender")
OF those who left the company, it looks like Job Level is positively
correlated with Age for both Males and Females. The slope of the linear
regression line for females seems to be more steep than it is for the
males.
ggplot(data = attrition_yes, aes(x=OverTime, fill = Gender)) + geom_bar() + ggtitle("Attrition - Overtime by Gender")
Of those who left the company, many employees were working over time. As
mentioned earlier, there are more male data points compared to females,
which is why at first glance it may seem a bit off. It looks like of the
Over time group - approximately 70% were males, and 30% was females.
ggplot(data = attrition_yes, aes(x=NumCompaniesWorked, y = PercentSalaryHike)) + geom_point(position = "jitter") + theme_minimal()+geom_smooth(method = lm) + ggtitle ("Attrition - Percent Salary Hike and Number of Companies Worked")
There seems to be a slight downward trend when looking at the
relationship between Number of companies worked and Percent salary hike.
I was trying to see if the number of companies worked affected the
percent salary hike positively. At first glance, it seems like there is
no variation int he line, but if you observe closely, there is a slight
downward trend.
ggplot(data = attrition_yes, aes(x = PercentSalaryHike, fill = OverTime)) +
geom_histogram( binwidth = 1, color = "black", alpha = 0.7) +
geom_density(aes(y = ..count..), fill = "transparent", color = "darkblue") +
labs(title = "Histogram with Trend Line of % Salary Hike by Overtime",
x = "Percent Salary Hike", y = "Count") +
theme_minimal() + theme(legend.position = "top") # Adjust legend position
This is a histogram of percent salary hike, with each bar being split
and shaded by if the employee(s) were working over time. The overall
percent salary hike (without the split) seems to be right skewed
generally, but the shape of the line/trend seems to imply that it may be
multimodal.
ggplot(data = attrition_yes, aes(x=YearsAtCompany, y = PercentSalaryHike, color = PerformanceRating)) + geom_point() + theme_minimal() + ggtitle("Salary Hike v. Years at Company with Performance Rating")
This plot takes a look at Years at Company and Percent salary hike.
Those with a higher performance rating seem to have a higher percent
salary hike. For a Percent salary hike less than 20%, it seems like the
performance rating is under 4, and trends around the 3 rating.
ggplot(data = attrition_yes, aes(y=JobSatisfaction, x = Education)) + geom_point(position="jitter") + theme_minimal() + ggtitle("Job Satisfaction v. Education Categorized by Education Field") + facet_wrap(~EducationField) + geom_smooth(method = lm)
This plot looks at the relationship between Job Satisfaction based on
Education categorized by education field. Most education fields seem to
have a negative correlation between job satisfaction and years of
education, except for the Medical field. The medical field is the only
field where as education increases, the job satifaction also seems to
increase.
#Building a Regression Model to Determine Salary
Monthly Income is the “salary” variable
class(Attritiondata$MonthlyIncome)
## [1] "numeric"
sum(is.na(Attritiondata$MonthlyIncome))
## [1] 0
summary(Attritiondata)
## ID Age Attrition BusinessTravel
## Min. : 1.0 Min. :18.00 Length:870 Length:870
## 1st Qu.:218.2 1st Qu.:30.00 Class :character Class :character
## Median :435.5 Median :35.00 Mode :character Mode :character
## Mean :435.5 Mean :36.83
## 3rd Qu.:652.8 3rd Qu.:43.00
## Max. :870.0 Max. :60.00
## DailyRate Department DistanceFromHome Education
## Min. : 103.0 Length:870 Min. : 1.000 Min. :1.000
## 1st Qu.: 472.5 Class :character 1st Qu.: 2.000 1st Qu.:2.000
## Median : 817.5 Mode :character Median : 7.000 Median :3.000
## Mean : 815.2 Mean : 9.339 Mean :2.901
## 3rd Qu.:1165.8 3rd Qu.:14.000 3rd Qu.:4.000
## Max. :1499.0 Max. :29.000 Max. :5.000
## EducationField EmployeeCount EmployeeNumber EnvironmentSatisfaction
## Length:870 Min. :1 Min. : 1.0 Min. :1.000
## Class :character 1st Qu.:1 1st Qu.: 477.2 1st Qu.:2.000
## Mode :character Median :1 Median :1039.0 Median :3.000
## Mean :1 Mean :1029.8 Mean :2.701
## 3rd Qu.:1 3rd Qu.:1561.5 3rd Qu.:4.000
## Max. :1 Max. :2064.0 Max. :4.000
## Gender HourlyRate JobInvolvement JobLevel
## Length:870 Min. : 30.00 Min. :1.000 Min. :1.000
## Class :character 1st Qu.: 48.00 1st Qu.:2.000 1st Qu.:1.000
## Mode :character Median : 66.00 Median :3.000 Median :2.000
## Mean : 65.61 Mean :2.723 Mean :2.039
## 3rd Qu.: 83.00 3rd Qu.:3.000 3rd Qu.:3.000
## Max. :100.00 Max. :4.000 Max. :5.000
## JobRole JobSatisfaction MaritalStatus MonthlyIncome
## Length:870 Min. :1.000 Length:870 Min. : 1081
## Class :character 1st Qu.:2.000 Class :character 1st Qu.: 2840
## Mode :character Median :3.000 Mode :character Median : 4946
## Mean :2.709 Mean : 6390
## 3rd Qu.:4.000 3rd Qu.: 8182
## Max. :4.000 Max. :19999
## MonthlyRate NumCompaniesWorked OverTime PercentSalaryHike
## Min. : 2094 Min. :0.000 Length:870 Min. :11.0
## 1st Qu.: 8092 1st Qu.:1.000 Class :character 1st Qu.:12.0
## Median :14074 Median :2.000 Mode :character Median :14.0
## Mean :14326 Mean :2.728 Mean :15.2
## 3rd Qu.:20456 3rd Qu.:4.000 3rd Qu.:18.0
## Max. :26997 Max. :9.000 Max. :25.0
## PerformanceRating RelationshipSatisfaction StandardHours StockOptionLevel
## Min. :3.000 Min. :1.000 Min. :80 Min. :0.0000
## 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:80 1st Qu.:0.0000
## Median :3.000 Median :3.000 Median :80 Median :1.0000
## Mean :3.152 Mean :2.707 Mean :80 Mean :0.7839
## 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:80 3rd Qu.:1.0000
## Max. :4.000 Max. :4.000 Max. :80 Max. :3.0000
## TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany
## Min. : 0.00 Min. :0.000 Min. :1.000 Min. : 0.000
## 1st Qu.: 6.00 1st Qu.:2.000 1st Qu.:2.000 1st Qu.: 3.000
## Median :10.00 Median :3.000 Median :3.000 Median : 5.000
## Mean :11.05 Mean :2.832 Mean :2.782 Mean : 6.962
## 3rd Qu.:15.00 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:10.000
## Max. :40.00 Max. :6.000 Max. :4.000 Max. :40.000
## YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
## Min. : 0.000 Min. : 0.000 Min. : 0.00
## 1st Qu.: 2.000 1st Qu.: 0.000 1st Qu.: 2.00
## Median : 3.000 Median : 1.000 Median : 3.00
## Mean : 4.205 Mean : 2.169 Mean : 4.14
## 3rd Qu.: 7.000 3rd Qu.: 3.000 3rd Qu.: 7.00
## Max. :18.000 Max. :15.000 Max. :17.00
# Identify character variables
char_vars <- sapply(Attritiondata, is.character)
# Convert character variables to factors
Attritiondata[, char_vars] <- lapply(Attritiondata[, char_vars], as.factor)
#Check Factor Levels for Categorical variables:
sapply(Attritiondata[,char_vars],levels)
## $Attrition
## [1] "No" "Yes"
##
## $BusinessTravel
## [1] "Non-Travel" "Travel_Frequently" "Travel_Rarely"
##
## $Department
## [1] "Human Resources" "Research & Development" "Sales"
##
## $EducationField
## [1] "Human Resources" "Life Sciences" "Marketing" "Medical"
## [5] "Other" "Technical Degree"
##
## $Gender
## [1] "Female" "Male"
##
## $JobRole
## [1] "Healthcare Representative" "Human Resources"
## [3] "Laboratory Technician" "Manager"
## [5] "Manufacturing Director" "Research Director"
## [7] "Research Scientist" "Sales Executive"
## [9] "Sales Representative"
##
## $MaritalStatus
## [1] "Divorced" "Married" "Single"
##
## $OverTime
## [1] "No" "Yes"
#Noticed Over18 has only 1 factor level; so going to remove from dataset
# Attritiondata <- subset(Attritiondata, select = -c(Over18))
# Fit the linear regression model with all predictors
Model1_fit <- lm(MonthlyIncome ~ ., data = Attritiondata)
summary(Model1_fit)
##
## Call:
## lm(formula = MonthlyIncome ~ ., data = Attritiondata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3708.8 -674.1 14.7 614.1 4100.0
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.013e+01 7.772e+02 0.077 0.938349
## ID -2.343e-01 1.476e-01 -1.588 0.112713
## Age -1.431e+00 5.649e+00 -0.253 0.800110
## AttritionYes 8.904e+01 1.154e+02 0.771 0.440729
## BusinessTravelTravel_Frequently 1.895e+02 1.420e+02 1.334 0.182441
## BusinessTravelTravel_Rarely 3.720e+02 1.200e+02 3.099 0.002005 **
## DailyRate 1.452e-01 9.129e-02 1.591 0.112062
## DepartmentResearch & Development 1.234e+02 4.768e+02 0.259 0.795866
## DepartmentSales -4.594e+02 4.877e+02 -0.942 0.346580
## DistanceFromHome -6.237e+00 4.578e+00 -1.362 0.173417
## Education -3.743e+01 3.716e+01 -1.007 0.314105
## EducationFieldLife Sciences 1.352e+02 3.692e+02 0.366 0.714248
## EducationFieldMarketing 1.377e+02 3.914e+02 0.352 0.725050
## EducationFieldMedical 3.326e+01 3.699e+02 0.090 0.928376
## EducationFieldOther 9.152e+01 3.946e+02 0.232 0.816664
## EducationFieldTechnical Degree 9.680e+01 3.843e+02 0.252 0.801179
## EmployeeCount NA NA NA NA
## EmployeeNumber 8.681e-02 6.103e-02 1.422 0.155269
## EnvironmentSatisfaction -6.267e+00 3.364e+01 -0.186 0.852252
## GenderMale 1.100e+02 7.442e+01 1.478 0.139715
## HourlyRate -3.591e-01 1.824e+00 -0.197 0.844003
## JobInvolvement 1.677e+01 5.321e+01 0.315 0.752698
## JobLevel 2.783e+03 8.340e+01 33.375 < 2e-16 ***
## JobRoleHuman Resources -2.053e+02 5.157e+02 -0.398 0.690663
## JobRoleLaboratory Technician -5.891e+02 1.714e+02 -3.437 0.000618 ***
## JobRoleManager 4.280e+03 2.830e+02 15.122 < 2e-16 ***
## JobRoleManufacturing Director 1.809e+02 1.696e+02 1.067 0.286497
## JobRoleResearch Director 4.077e+03 2.193e+02 18.592 < 2e-16 ***
## JobRoleResearch Scientist -3.494e+02 1.705e+02 -2.049 0.040790 *
## JobRoleSales Executive 5.263e+02 3.576e+02 1.472 0.141449
## JobRoleSales Representative 8.531e+01 3.918e+02 0.218 0.827703
## JobSatisfaction 3.278e+01 3.344e+01 0.980 0.327278
## MaritalStatusMarried 6.708e+01 1.002e+02 0.669 0.503497
## MaritalStatusSingle 1.128e+01 1.361e+02 0.083 0.933978
## MonthlyRate -9.505e-03 5.143e-03 -1.848 0.064946 .
## NumCompaniesWorked 5.421e+00 1.691e+01 0.321 0.748622
## OverTimeYes -1.394e+01 8.434e+01 -0.165 0.868787
## PercentSalaryHike 2.586e+01 1.581e+01 1.635 0.102351
## PerformanceRating -3.235e+02 1.614e+02 -2.004 0.045368 *
## RelationshipSatisfaction 1.640e+01 3.339e+01 0.491 0.623375
## StandardHours NA NA NA NA
## StockOptionLevel -2.758e+00 5.740e+01 -0.048 0.961695
## TotalWorkingYears 5.080e+01 1.098e+01 4.627 4.3e-06 ***
## TrainingTimesLastYear 2.436e+01 2.912e+01 0.837 0.403111
## WorkLifeBalance -3.472e+01 5.161e+01 -0.673 0.501284
## YearsAtCompany -2.750e+00 1.370e+01 -0.201 0.840925
## YearsInCurrentRole 3.398e+00 1.711e+01 0.199 0.842584
## YearsSinceLastPromotion 3.084e+01 1.532e+01 2.013 0.044405 *
## YearsWithCurrManager -2.691e+01 1.669e+01 -1.613 0.107210
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1055 on 823 degrees of freedom
## Multiple R-squared: 0.9501, Adjusted R-squared: 0.9473
## F-statistic: 340.7 on 46 and 823 DF, p-value: < 2.2e-16
#p val < alpha of .05, it affects the Salary Variable
#Model1_Preds = predict(Model1_fit, newdata = NoSalary) #this is an example of predict function you would want to use
#as.data.frame(Model1_Preds)
#write.csv(Model1_Preds,"Model1PredictionsNoSalaryRenuKarthikeyan.csv")
Looking at this summary of model 1 output, it indicates that the statistically significant p values are Business Travel, JobLevel, Job Role, Performance rating, Total working Years, and Years since last promotion. The F-statistic tests the overall significance of the model. The F-statistic is 340.7 with a very small p-value (< 2.2e-16), suggests that at least one predictor variable is significantly related to Monthly Income.There are two coefficients not defined because of singularities. This might indicate multicollinearity, where two or more predictor variables are highly correlated.
Model2_fit = lm(MonthlyIncome ~ NumCompaniesWorked + Age + Gender + MaritalStatus + JobInvolvement + JobRole + DistanceFromHome + JobLevel + Education, data = Attritiondata)
summary(Model2_fit) # P value overall implies that at least one of my variables' slope != 0.
##
## Call:
## lm(formula = MonthlyIncome ~ NumCompaniesWorked + Age + Gender +
## MaritalStatus + JobInvolvement + JobRole + DistanceFromHome +
## JobLevel + Education, data = Attritiondata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3536.4 -699.4 -39.1 659.8 4178.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -600.926 320.383 -1.876 0.06105 .
## NumCompaniesWorked 17.866 15.366 1.163 0.24526
## Age 12.875 4.962 2.595 0.00963 **
## GenderMale 120.277 75.308 1.597 0.11061
## MaritalStatusMarried 106.159 95.406 1.113 0.26615
## MaritalStatusSingle 23.451 103.944 0.226 0.82155
## JobInvolvement 15.469 52.613 0.294 0.76882
## JobRoleHuman Resources -327.849 255.950 -1.281 0.20057
## JobRoleLaboratory Technician -534.026 172.285 -3.100 0.00200 **
## JobRoleManager 3933.510 232.847 16.893 < 2e-16 ***
## JobRoleManufacturing Director 92.825 169.877 0.546 0.58492
## JobRoleResearch Director 3919.755 218.518 17.938 < 2e-16 ***
## JobRoleResearch Scientist -246.097 171.976 -1.431 0.15280
## JobRoleSales Executive -122.894 146.575 -0.838 0.40202
## JobRoleSales Representative -392.881 217.075 -1.810 0.07067 .
## DistanceFromHome -7.908 4.553 -1.737 0.08281 .
## JobLevel 3042.789 69.854 43.559 < 2e-16 ***
## Education -33.713 37.379 -0.902 0.36736
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1078 on 852 degrees of freedom
## Multiple R-squared: 0.9461, Adjusted R-squared: 0.945
## F-statistic: 878.9 on 17 and 852 DF, p-value: < 2.2e-16
NoSalary$MonthlyIncome = predict(Model2_fit,newdata = NoSalary)
#Model2_Preds<- NoSalary %>% select(c("ID","MonthlyIncome"))
#as.data.frame(Model2_Preds)
#write.csv(Model2_Preds,"Model2PredictionsNoSalaryRenuKarthikeyan.csv", row.names = T)
These select predictors were chosen because of the insights from the EDA. I thought they were significant predictors. The coefficient for JobLevel is 3042.789. This suggests that, on average, an increase of one unit in JobLevel is associated with an increase of 3042.789 dollars in MonthlyIncome.Likewise, Age has a coefficient of 12.875 indicating that a year increase in age, results in 12.88 dollars additional monthly income, holding all other variables constant. Males have a coefficient of 120, indicating that holding all other variables constant, males make an additional 120 dollars compared to women monthly. The coefficient for distance from home has a. -7.908, which indicates each additional mile away from home may result in a monthly salary decrease by -7 dollars.
The statistically significant p values (<.10) are Age, certain Job Roles – like Laboratory Technician, Manager, and Research director, Job Level;
Overall, Model 2 appears to have a high R-squared value, indicating a good fit to the data. Many predictors are statistically significant, suggesting they contribute to determining Monthly Income
set.seed(1234)
TrainObs = sample(seq(1,dim(Attritiondata)[1]),round(.8*dim(Attritiondata)[1]),replace = FALSE)
SalaryTrain = Attritiondata[TrainObs,]
SalaryTrain
## # A tibble: 696 × 35
## ID Age Attrition BusinessTravel DailyRate Department DistanceFromHome
## <dbl> <dbl> <fct> <fct> <dbl> <fct> <dbl>
## 1 284 31 No Travel_Rarely 691 Sales 7
## 2 848 39 No Travel_Rarely 1132 Research … 1
## 3 101 27 No Travel_Rarely 1377 Research … 11
## 4 623 34 No Travel_Rarely 182 Research … 1
## 5 645 41 Yes Non-Travel 906 Research … 5
## 6 400 44 No Travel_Frequently 1193 Research … 2
## 7 98 37 No Non-Travel 1040 Research … 2
## 8 103 34 No Non-Travel 1381 Sales 4
## 9 726 25 No Travel_Rarely 685 Research … 1
## 10 602 32 No Non-Travel 1200 Research … 1
## # ℹ 686 more rows
## # ℹ 28 more variables: Education <dbl>, EducationField <fct>,
## # EmployeeCount <dbl>, EmployeeNumber <dbl>, EnvironmentSatisfaction <dbl>,
## # Gender <fct>, HourlyRate <dbl>, JobInvolvement <dbl>, JobLevel <dbl>,
## # JobRole <fct>, JobSatisfaction <dbl>, MaritalStatus <fct>,
## # MonthlyIncome <dbl>, MonthlyRate <dbl>, NumCompaniesWorked <dbl>,
## # OverTime <fct>, PercentSalaryHike <dbl>, PerformanceRating <dbl>, …
SalaryTest = Attritiondata[-TrainObs,]
SalaryTest
## # A tibble: 174 × 35
## ID Age Attrition BusinessTravel DailyRate Department DistanceFromHome
## <dbl> <dbl> <fct> <fct> <dbl> <fct> <dbl>
## 1 5 24 No Travel_Frequently 567 Research … 2
## 2 8 37 No Travel_Rarely 309 Sales 10
## 3 9 34 No Travel_Rarely 1333 Sales 10
## 4 16 31 No Non-Travel 1188 Sales 20
## 5 18 46 No Non-Travel 1144 Research … 7
## 6 20 44 No Travel_Rarely 170 Research … 1
## 7 26 44 No Travel_Rarely 1099 Sales 5
## 8 44 42 No Travel_Rarely 1059 Research … 9
## 9 46 42 No Travel_Frequently 748 Research … 9
## 10 47 32 Yes Travel_Rarely 964 Sales 1
## # ℹ 164 more rows
## # ℹ 28 more variables: Education <dbl>, EducationField <fct>,
## # EmployeeCount <dbl>, EmployeeNumber <dbl>, EnvironmentSatisfaction <dbl>,
## # Gender <fct>, HourlyRate <dbl>, JobInvolvement <dbl>, JobLevel <dbl>,
## # JobRole <fct>, JobSatisfaction <dbl>, MaritalStatus <fct>,
## # MonthlyIncome <dbl>, MonthlyRate <dbl>, NumCompaniesWorked <dbl>,
## # OverTime <fct>, PercentSalaryHike <dbl>, PerformanceRating <dbl>, …
Model1_fit <- lm(MonthlyIncome ~ ., data = Attritiondata)
summary(Model1_fit)
##
## Call:
## lm(formula = MonthlyIncome ~ ., data = Attritiondata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3708.8 -674.1 14.7 614.1 4100.0
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.013e+01 7.772e+02 0.077 0.938349
## ID -2.343e-01 1.476e-01 -1.588 0.112713
## Age -1.431e+00 5.649e+00 -0.253 0.800110
## AttritionYes 8.904e+01 1.154e+02 0.771 0.440729
## BusinessTravelTravel_Frequently 1.895e+02 1.420e+02 1.334 0.182441
## BusinessTravelTravel_Rarely 3.720e+02 1.200e+02 3.099 0.002005 **
## DailyRate 1.452e-01 9.129e-02 1.591 0.112062
## DepartmentResearch & Development 1.234e+02 4.768e+02 0.259 0.795866
## DepartmentSales -4.594e+02 4.877e+02 -0.942 0.346580
## DistanceFromHome -6.237e+00 4.578e+00 -1.362 0.173417
## Education -3.743e+01 3.716e+01 -1.007 0.314105
## EducationFieldLife Sciences 1.352e+02 3.692e+02 0.366 0.714248
## EducationFieldMarketing 1.377e+02 3.914e+02 0.352 0.725050
## EducationFieldMedical 3.326e+01 3.699e+02 0.090 0.928376
## EducationFieldOther 9.152e+01 3.946e+02 0.232 0.816664
## EducationFieldTechnical Degree 9.680e+01 3.843e+02 0.252 0.801179
## EmployeeCount NA NA NA NA
## EmployeeNumber 8.681e-02 6.103e-02 1.422 0.155269
## EnvironmentSatisfaction -6.267e+00 3.364e+01 -0.186 0.852252
## GenderMale 1.100e+02 7.442e+01 1.478 0.139715
## HourlyRate -3.591e-01 1.824e+00 -0.197 0.844003
## JobInvolvement 1.677e+01 5.321e+01 0.315 0.752698
## JobLevel 2.783e+03 8.340e+01 33.375 < 2e-16 ***
## JobRoleHuman Resources -2.053e+02 5.157e+02 -0.398 0.690663
## JobRoleLaboratory Technician -5.891e+02 1.714e+02 -3.437 0.000618 ***
## JobRoleManager 4.280e+03 2.830e+02 15.122 < 2e-16 ***
## JobRoleManufacturing Director 1.809e+02 1.696e+02 1.067 0.286497
## JobRoleResearch Director 4.077e+03 2.193e+02 18.592 < 2e-16 ***
## JobRoleResearch Scientist -3.494e+02 1.705e+02 -2.049 0.040790 *
## JobRoleSales Executive 5.263e+02 3.576e+02 1.472 0.141449
## JobRoleSales Representative 8.531e+01 3.918e+02 0.218 0.827703
## JobSatisfaction 3.278e+01 3.344e+01 0.980 0.327278
## MaritalStatusMarried 6.708e+01 1.002e+02 0.669 0.503497
## MaritalStatusSingle 1.128e+01 1.361e+02 0.083 0.933978
## MonthlyRate -9.505e-03 5.143e-03 -1.848 0.064946 .
## NumCompaniesWorked 5.421e+00 1.691e+01 0.321 0.748622
## OverTimeYes -1.394e+01 8.434e+01 -0.165 0.868787
## PercentSalaryHike 2.586e+01 1.581e+01 1.635 0.102351
## PerformanceRating -3.235e+02 1.614e+02 -2.004 0.045368 *
## RelationshipSatisfaction 1.640e+01 3.339e+01 0.491 0.623375
## StandardHours NA NA NA NA
## StockOptionLevel -2.758e+00 5.740e+01 -0.048 0.961695
## TotalWorkingYears 5.080e+01 1.098e+01 4.627 4.3e-06 ***
## TrainingTimesLastYear 2.436e+01 2.912e+01 0.837 0.403111
## WorkLifeBalance -3.472e+01 5.161e+01 -0.673 0.501284
## YearsAtCompany -2.750e+00 1.370e+01 -0.201 0.840925
## YearsInCurrentRole 3.398e+00 1.711e+01 0.199 0.842584
## YearsSinceLastPromotion 3.084e+01 1.532e+01 2.013 0.044405 *
## YearsWithCurrManager -2.691e+01 1.669e+01 -1.613 0.107210
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1055 on 823 degrees of freedom
## Multiple R-squared: 0.9501, Adjusted R-squared: 0.9473
## F-statistic: 340.7 on 46 and 823 DF, p-value: < 2.2e-16
#Model1_Preds = predict(Model1_fit, newdata = NoSalary)
#as.data.frame(Model1_Preds)
#write.csv(Model1_Preds,"Model1PredictionsNoSalaryRenuKarthikeyan")
#Cross Validation and Mean Square Predictor Error Calculation
numMSPEs = 1000
MSPEHolderModel1 = numeric(numMSPEs)
MSPEHolderModel2 = numeric(numMSPEs)
RMSEHolderModel1 = numeric(numMSPEs)
RMSEHolderModel2 = numeric(numMSPEs)
for (i in 1:numMSPEs)
{
TrainObs = sample(seq(1,dim(Attritiondata)[1]),round(.8*dim(Attritiondata)[1]),replace = FALSE)
SalaryTrain = Attritiondata[TrainObs,]
SalaryTrain
SalaryTest = Attritiondata[-TrainObs,]
SalaryTest
Model1_fit <- lm(MonthlyIncome ~ ., data = SalaryTrain)
Model1_Preds = predict(Model1_fit, newdata = SalaryTest)
#MSPE Model 1
MSPE = mean((SalaryTest$MonthlyIncome - Model1_Preds)^2)
MSPE
MSPEHolderModel1[i] = MSPE
RMSEHolderModel1[i] = sqrt(MSPE)
#Model 2
Model2_fit = lm(MonthlyIncome ~ NumCompaniesWorked + Age + Gender + MaritalStatus + JobInvolvement + JobRole + DistanceFromHome + JobLevel + Education, data = SalaryTrain)
Model2_Preds = predict(Model2_fit,newdata = SalaryTest)
MSPE = mean((SalaryTest$MonthlyIncome - Model2_Preds)^2)
MSPE
MSPEHolderModel2[i] = MSPE
RMSEHolderModel2[i] = sqrt(MSPE)
}
mean(MSPEHolderModel1)
## [1] 1196649
mean(MSPEHolderModel2)
## [1] 1194877
mean(RMSEHolderModel1)
## [1] 1091.936
mean(RMSEHolderModel2)
## [1] 1091.156
Conclusion: Model 2 is the better fit as it has a lower mean RMSE and lower mean MSPE
When I initially attempted to use all predictors for KNN, each time, the model would unfortunately run into errors and say “Warning: NAs introduced by coercionError in knn(train_predictors, test_predictors, response_train, prob = TRUE,: NA/NaN/Inf in foreign function call (arg 6)”, although there are no missing values in the Attritiondata data set. I later realized that it was due to NAs being assigned to the categorical variables during the KNN chunk. I created dummy columns for these categorical variables to get the KNN to function without running into errors. Below is the code used to create the dummy columns from the FastDummies package in R.
Attritiondata$BusinessTravel<- as.factor(Attritiondata$BusinessTravel)
Attritiondata$Department<- as.factor(Attritiondata$Department)
Attritiondata$EducationField<- as.factor(Attritiondata$EducationField)
Attritiondata$Gender<- as.factor(Attritiondata$Gender)
Attritiondata$JobRole<- as.factor(Attritiondata$JobRole)
Attritiondata$MaritalStatus<- as.factor(Attritiondata$MaritalStatus)
Attritiondata$OverTime<- as.factor(Attritiondata$OverTime)
Attritiondata<-dummy_cols(Attritiondata,select_columns=c("BusinessTravel","MaritalStatus","JobRole","Department","EducationField","OverTime","Gender"))
Attritiondata <- Attritiondata %>% select(-c("BusinessTravel","MaritalStatus","JobRole","Department","EducationField","OverTime","Gender"))
#Doing Same for Attrition Test Data Set (AttritionTest)
AttritionTest$BusinessTravel<- as.factor(AttritionTest$BusinessTravel)
AttritionTest$Department<- as.factor(AttritionTest$Department)
AttritionTest$EducationField<- as.factor(AttritionTest$EducationField)
AttritionTest$Gender<- as.factor(AttritionTest$Gender)
AttritionTest$JobRole<- as.factor(AttritionTest$JobRole)
AttritionTest$MaritalStatus<- as.factor(AttritionTest$MaritalStatus)
AttritionTest$OverTime<- as.factor(AttritionTest$OverTime)
AttritionTest<-dummy_cols(AttritionTest,select_columns=c("BusinessTravel","MaritalStatus","JobRole","Department","EducationField","OverTime","Gender"))
AttritionTest <- AttritionTest %>% select(-c("BusinessTravel","MaritalStatus","JobRole","Department","EducationField","OverTime","Gender"))
diff_columns_df1 <- setdiff(names(Attritiondata), names(AttritionTest))
cat("Columns in df2 but not in df1:", paste(diff_columns_df1, collapse = ", "), "\n")
## Columns in df2 but not in df1: Attrition
set.seed(1234)
iterations <- 100
numks <- 10
splitPerc <- 0.8
masterAcc <- matrix(nrow = iterations, ncol = numks)
for (j in 1:iterations) {
trainIndices <- sample(1:dim(Attritiondata)[1], round(splitPerc * dim(Attritiondata)[1]))
train <- as.data.frame(Attritiondata[trainIndices, ])
test <- as.data.frame(Attritiondata[-trainIndices, ])
response_variable <- "Attrition"
response_train <- factor(train[[response_variable]])
response_test <- factor(test[[response_variable]])
# Select columns for predictors
selected_columns <- c(1, 2, 4:36) # Adjust this range as needed
#selected_columns <- c("ID","Age","BusinessTravel","DailyRate","Department", "DistanceFromHome", "Education", "EducationField", "EmployeeCount", "EmployeeNumber","EnvironmentSatisfaction", "Gender", "HourlyRate", "JobInvolvement", "JobLevel", "JobRole", "JobSatisfaction", "MaritalStatus", "MonthlyIncome", "MonthlyRate", "NumCompaniesWorked", "OverTime", "PercentSalaryHike", "PerformanceRating", "RelationshipSatisfaction", "StandardHours", "StockOptionLevel", "TotalWorkingYears", "TrainingTimesLastYear", "WorkLifeBalance", "YearsAtCompany", "YearsInCurrentRole", "YearsSinceLastPromotion", "YearsWithCurrManager")
# Extract the selected columns
train_predictors <- train[, selected_columns, drop = FALSE]
test_predictors <- test[, selected_columns, drop = FALSE]
train_predictors <- apply(train_predictors, 2, as.numeric)
test_predictors <- apply(test_predictors, 2, as.numeric)
train_predictors <- scale(train_predictors)
test_predictors <- scale(test_predictors)
# Convert to numeric matrices
train_predictors <- as.matrix(train_predictors)
test_predictors <- as.matrix(test_predictors)
# Remove infinite values
train_predictors[!is.finite(train_predictors)] <- 0
test_predictors[!is.finite(test_predictors)] <- 0
if (sum(response_train == "Yes") > 0 && sum(response_test == "Yes") > 0) {
for (i in 1:numks) {
classifications <- knn(train_predictors, test_predictors, response_train, prob = TRUE, k = i)
table(classifications, response_test)
CM_AllK<- confusionMatrix(table(classifications, response_test), positive = "Yes")
masterAcc[j, i] <- CM_AllK$overall[1]
}
}
}
CM_AllK
## Confusion Matrix and Statistics
##
## response_test
## classifications No Yes
## No 143 26
## Yes 2 3
##
## Accuracy : 0.8391
## 95% CI : (0.7759, 0.8903)
## No Information Rate : 0.8333
## P-Value [Acc > NIR] : 0.4685
##
## Kappa : 0.134
##
## Mcnemar's Test P-Value : 1.383e-05
##
## Sensitivity : 0.10345
## Specificity : 0.98621
## Pos Pred Value : 0.60000
## Neg Pred Value : 0.84615
## Prevalence : 0.16667
## Detection Rate : 0.01724
## Detection Prevalence : 0.02874
## Balanced Accuracy : 0.54483
##
## 'Positive' Class : Yes
##
MeanAcc = colMeans(masterAcc); MeanAcc
## [1] 0.7946552 0.7893103 0.8301724 0.8294828 0.8386782 0.8394828 0.8433333
## [8] 0.8406897 0.8425287 0.8426437
plot(seq(1, numks, 1), MeanAcc, type = "l", ylab = "Mean Accuracy (Positive Class: Yes)")
which.max(MeanAcc)
## [1] 7
max(MeanAcc)
## [1] 0.8433333
From the plot, we see that the best k is k = 7. The overall accuracy is 83.91%, but sensitivity (True Positive Rate) is low (10.35%).The model is better at correctly predicting the majority class (“No”) but struggles with the minority class (“Yes”).
Looking specifically at the Confusion Matrix statistics, this is the output and the interpretation of each of these statistics: - Sensitivity (True Positive Rate): 0.10345 The proportion of actual positives correctly predicted for those who attrited. - Specificity (True Negative Rate): 0.98621 The proportion of actual negatives correctly predicted (for those who did not leave the company) - Positive Predictive Value (Precision): 0.60000 The proportion of predicted positives that are true positives (attrited correctly identified as attrited) - Negative Predictive Value: 0.84615 The proportion of predicted negatives that are true negatives. (not attrited correctly identified as not attrited) - Prevalence: 0.16667 The proportion of actual positives in the dataset. (proportion of attrited in the overall dataset)
set.seed(1234)
iterations<- 100
accuracy_table <- numeric(iterations)
trainIndices <- sample(1:dim(Attritiondata)[1], round(splitPerc * dim(Attritiondata)[1]))
accuracy_table <- numeric(iterations)
train <- as.data.frame(Attritiondata[trainIndices, ])
test <- as.data.frame(Attritiondata[-trainIndices, ])
train_features<- train[, -which(names(train) == "Attrition")]
test_features<- test[, -which(names(test) == "Attrition")]
train_target<- train$Attrition
# train_scaled<- scale(train_features) Using this in the knn returns "No missing values allowed"
#test_scaled<- scale(test_features)
for (i in 1:iterations) {
classifications <- knn(train_features, test_features, train_target, prob = TRUE, k = 7)
table(classifications, response_test)
CM_AllK7 <- confusionMatrix(table(as.factor(test$Attrition),classifications), positive ="Yes")
accuracy_table[i] <- CM_AllK7$overall[1]
}
#print(accuracy_table)
avg_accuracy<-mean(accuracy_table[1])
avg_accuracy
## [1] 0.7988506
specificity_table<- numeric(iterations)
sensitivity_table<- numeric(iterations)
for (i in 1:iterations) {
classifications <- knn(train_features, test_features, train_target, prob = TRUE, k = 7)
table(classifications, response_test)
CM_AllK7 <- confusionMatrix(table(as.factor(test$Attrition),classifications), positive ="Yes")
specificity_table[i] <- CM_AllK7$byClass['Specificity']
sensitivity_table[i] <- CM_AllK7$byClass['Sensitivity']
}
CM_AllK7
## Confusion Matrix and Statistics
##
## classifications
## No Yes
## No 137 3
## Yes 32 2
##
## Accuracy : 0.7989
## 95% CI : (0.7315, 0.8557)
## No Information Rate : 0.9713
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0552
##
## Mcnemar's Test P-Value : 2.214e-06
##
## Sensitivity : 0.40000
## Specificity : 0.81065
## Pos Pred Value : 0.05882
## Neg Pred Value : 0.97857
## Prevalence : 0.02874
## Detection Rate : 0.01149
## Detection Prevalence : 0.19540
## Balanced Accuracy : 0.60533
##
## 'Positive' Class : Yes
##
avg_specificity<-mean(specificity_table[1])
avg_specificity
## [1] 0.8106509
avg_sensitivity<-mean(sensitivity_table[1])
avg_sensitivity
## [1] 0.4
###### New Threshold using classifications which used k = 7 from above
#classifications
#attributes(classifications) # Look at possible attributes
#attributes(classifications)$prob # Probability of what was classified for that observation
probs = ifelse(classifications == "Yes",attributes(classifications)$prob, 1- attributes(classifications)$prob)
summary(Attritiondata$Attrition)
## No Yes
## 730 140
140/(730+140) #16.09%
## [1] 0.1609195
NewClass = ifelse(probs > .1609, "Yes", "No")
NewClass <- factor(NewClass, levels = levels(response_test))
table(NewClass,response_test)
## response_test
## NewClass No Yes
## No 115 18
## Yes 30 11
CM_NewThreshold <- confusionMatrix(table(NewClass, response_test), positive = "Yes", mode = "everything")
CM_NewThreshold
## Confusion Matrix and Statistics
##
## response_test
## NewClass No Yes
## No 115 18
## Yes 30 11
##
## Accuracy : 0.7241
## 95% CI : (0.6514, 0.7891)
## No Information Rate : 0.8333
## P-Value [Acc > NIR] : 0.9999
##
## Kappa : 0.1479
##
## Mcnemar's Test P-Value : 0.1124
##
## Sensitivity : 0.37931
## Specificity : 0.79310
## Pos Pred Value : 0.26829
## Neg Pred Value : 0.86466
## Precision : 0.26829
## Recall : 0.37931
## F1 : 0.31429
## Prevalence : 0.16667
## Detection Rate : 0.06322
## Detection Prevalence : 0.23563
## Balanced Accuracy : 0.58621
##
## 'Positive' Class : Yes
##
Similar to before, the loop ran for 100 iterations, but k was set to 7 when it came to knn. There are 2 confusion matrices. We have one, without the threshold changes, and one with the threshold change. The accuracy went down with the threshold change, while sensitivity reduced by a lot, and specificity reduced by ~10%. The positive predictive value increased by 9%, and the negative pred value reduced by 14%.
The new threshold has a lower accuracy compared to the original kNN classification.Sensitivity is significantly lower for the new threshold, indicating that fewer true positives are captured.Specificity is slightly lower for the new threshold, indicating a decrease in correctly identified true negatives. Precision is lower for the new threshold, reflecting a decrease in the accuracy of positive predictions. The original KNN classification has a higher accuracy and sensitivity compared to the new threshold.
#Applying to the Test Model to predict attrition using KNN with all predictors (This is best model)
train_features<- Attritiondata[, -which(names(Attritiondata) == "Attrition")]
test_features<- AttritionTest[]#[, -which(names(AttritionTest)=="Attrition")]
#head(test_features,5)
train_target <- Attritiondata$Attrition
AttritionClassifications <- knn(train_features, test_features, train_target, prob = TRUE, k = 7)
AttritionTest$Attrition <- AttritionClassifications
#head(AttritionTest,5)
AttritionPredictionsKNN<-AttritionTest%>%select(c("ID","Attrition"))
#head(AttritionPredictionsKNN,5)
write.csv(AttritionPredictionsKNN,"CaseStudy2AttritionPredictionsKNN_RenuKarthikeyan.csv", row.names = T)
After doing my Exploratory Data Analysis, I believe the important predictors for Attrition are: Gender, Department,Job Satisfaction,Distance From Home,Monthly Income, Job Level, Age, Over Time, Percent Salary Hike, Performance Rating, and Education.
set.seed(1234)
iterations = 100
numks = 10
splitPerc = .8
masterAcc = matrix(nrow = iterations, ncol = numks)
for (j in 1:iterations) {
trainIndices <- sample(1:dim(Attritiondata)[1], round(splitPerc * dim(Attritiondata)[1]))
train <- as.data.frame(Attritiondata[trainIndices, ])
test <- as.data.frame(Attritiondata[-trainIndices, ])
response_train <- factor(train$Attrition)
response_test <- factor(test$Attrition)
#UPDATE HERE!!!!
#train_predictors <- train[, c(2,6,7,8,13,16,18,20,24,25,26)]
#test_predictors <- test[, c(2,6,7,8,13,16,18,20,24,25,26) ]
selected_cols <- c(2, 6, 7, 8, 13, 16, 18, 20, 24, 25, 26)
train_predictors <- train[, selected_cols] #subset function
test_predictors <- test[, selected_cols]
train_predictors <- apply(train_predictors, 2, as.numeric)
test_predictors <- apply(test_predictors, 2, as.numeric)
train_predictors <- scale(train_predictors)
test_predictors <- scale(test_predictors)
train_predictors <- as.matrix(train_predictors)
test_predictors <- as.matrix(test_predictors)
# Remove infinite values
train_predictors[!is.finite(train_predictors)] <- 0
test_predictors[!is.finite(test_predictors)] <- 0
for (i in 1:numks) {
classifications <- knn(train_predictors, test_predictors, response_train, prob =TRUE, k = i)
table(classifications, response_test)
CM_Select<- confusionMatrix(table(classifications, response_test), positive = "Yes")
masterAcc[j, i] <- CM_Select$overall[1]
}
}
CM_Select
## Confusion Matrix and Statistics
##
## response_test
## classifications No Yes
## No 144 24
## Yes 4 2
##
## Accuracy : 0.8391
## 95% CI : (0.7759, 0.8903)
## No Information Rate : 0.8506
## P-Value [Acc > NIR] : 0.7085939
##
## Kappa : 0.0731
##
## Mcnemar's Test P-Value : 0.0003298
##
## Sensitivity : 0.07692
## Specificity : 0.97297
## Pos Pred Value : 0.33333
## Neg Pred Value : 0.85714
## Prevalence : 0.14943
## Detection Rate : 0.01149
## Detection Prevalence : 0.03448
## Balanced Accuracy : 0.52495
##
## 'Positive' Class : Yes
##
MeanAcc = colMeans(masterAcc); MeanAcc
## [1] 0.7545977 0.7523563 0.8077586 0.8095977 0.8272414 0.8263793 0.8315517
## [8] 0.8305172 0.8338506 0.8359195
plot(seq(1, numks, 1), MeanAcc, type = "l", ylab = "Mean Accuracy (Positive Class: Yes)")
which.max(MeanAcc)
## [1] 10
max(MeanAcc)
## [1] 0.8359195
We see that the best k is k = 10. The confusion matrix is taking an average of all the k’s tried. Here, the accuracy is 83.91%, similar to the initial average confusion matrix seen for all predictors using KNN. The sensitivity is quite low at 7.69%, suggesting that the model is struggling to correctly identify positive (true Attrition) instances. The model shows high specificity (97.30%), indicating a decent ability to correctly identify negative instances (not attrition). The positive predictive value (precision) is at 33.33%, indicating that among instances predicted as positive, about one-third are true positives.
iterations<- 100
accuracy_table <- numeric(iterations)
trainIndices <- sample(1:dim(Attritiondata)[1], round(splitPerc * dim(Attritiondata)[1]))
train <- as.data.frame(Attritiondata[trainIndices, ])
test <- as.data.frame(Attritiondata[-trainIndices, ])
train_features<- train[, selected_cols]
test_features<- test[, selected_cols]
train_target<- train$Attrition
train_scaled<- scale(train_features)
test_scaled<- scale(test_features)
for (i in 1:iterations) {
classifications <- knn(train_features, test_features, train_target, prob = TRUE, k = 10)
table(classifications, response_test)
CM_SelectK <- confusionMatrix(table(as.factor(test$Attrition),classifications), positive ="Yes")
accuracy_table[i] <- CM_SelectK$overall[1]
}
#print(accuracy_table)
avg_accuracy<-mean(accuracy_table[1]); avg_accuracy
## [1] 0.8275862
specificity_table<- numeric(iterations)
sensitivity_table<- numeric(iterations)
for (i in 1:iterations) {
classifications <- knn(train_features, test_features, train_target, prob = TRUE, k = 10)
table(classifications, response_test)
CM_SelectK<- confusionMatrix(table(as.factor(test$Attrition),classifications), positive ="Yes")
specificity_table[i] <- CM_SelectK$byClass['Specificity']
sensitivity_table[i] <- CM_SelectK$byClass['Sensitivity']
}
CM_SelectK
## Confusion Matrix and Statistics
##
## classifications
## No Yes
## No 145 0
## Yes 29 0
##
## Accuracy : 0.8333
## 95% CI : (0.7695, 0.8854)
## No Information Rate : 1
## P-Value [Acc > NIR] : 1
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 1.999e-07
##
## Sensitivity : NA
## Specificity : 0.8333
## Pos Pred Value : NA
## Neg Pred Value : NA
## Prevalence : 0.0000
## Detection Rate : 0.0000
## Detection Prevalence : 0.1667
## Balanced Accuracy : NA
##
## 'Positive' Class : Yes
##
avg_specificity<-mean(specificity_table[1])
avg_specificity
## [1] 0.8323699
avg_sensitivity<-mean(sensitivity_table[1])
avg_sensitivity
## [1] 0
We see the average specificity is 83.43% and sensitivity is 0%. This model’s performance is notably poor, with zero sensitivity, meaning it failed to correctly identify any positive instances. The specificity and negative predictive value are relatively high, but the lack of sensitivity indicates a serious limitation in identifying instances of the positive class. This suggests that the model might need further refinement or a different approach to address the imbalance and improve its ability to correctly classify positive instances.
set.seed(1234)
iterations = 100
masterAcc = matrix(nrow = iterations)
splitPerc = .8 #Training / Test split Percentage
for(j in 1:iterations)
{
trainIndices <- sample(1:dim(Attritiondata)[1], round(splitPerc * dim(Attritiondata)[1]))
train <- as.data.frame(Attritiondata[trainIndices, ])
test <- as.data.frame(Attritiondata[-trainIndices, ])
train$Attrition <- factor(train$Attrition, levels = c("Yes", "No"))
test$Attrition <- factor(test$Attrition, levels = c("Yes", "No"))
model <- naiveBayes(train[, -3], as.factor(train$Attrition), laplace = 1)
predictions <- predict(model, test[, -3])
confMatrix <- table(predictions, as.factor(test$Attrition))
CM_NB_All <- confusionMatrix(confMatrix)
masterAcc[j] <- CM_NB_All$overall[1]
}
CM_NB_All
## Confusion Matrix and Statistics
##
##
## predictions Yes No
## Yes 23 92
## No 2 57
##
## Accuracy : 0.4598
## 95% CI : (0.3841, 0.5368)
## No Information Rate : 0.8563
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1211
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.9200
## Specificity : 0.3826
## Pos Pred Value : 0.2000
## Neg Pred Value : 0.9661
## Prevalence : 0.1437
## Detection Rate : 0.1322
## Detection Prevalence : 0.6609
## Balanced Accuracy : 0.6513
##
## 'Positive' Class : Yes
##
MeanAcc = colMeans(masterAcc); MeanAcc
## [1] 0.6482184
Given all the predictors, the sensitivity rate of the model is the percentage of actual attrition cases correctly identified. It measures the model’s ability to capture employees who are truly at risk of attrition among all employees who actually attrite. This model shows imbalanced performance with high sensitivity (high attrition) but low specificity. It performs well in identifying actual positive instances (attrition) but struggles to correctly identify negative instances(no attrition). The positive predictive value is relatively low, indicating that when it predicts a positive instance, it has a 20% chance of being correct.
set.seed(1234)
iterations = 100
masterAcc = matrix(nrow = iterations)
splitPerc = .8 #Training / Test split Percentage
for(j in 1:iterations)
{
trainIndices <- sample(1:dim(Attritiondata)[1], round(splitPerc * dim(Attritiondata)[1]))
train <- as.data.frame(Attritiondata[trainIndices, ])
test <- as.data.frame(Attritiondata[-trainIndices, ])
train$Attrition <- factor(train$Attrition, levels = c("Yes", "No"))
test$Attrition <- factor(test$Attrition, levels = c("Yes", "No"))
model2 <- naiveBayes(train[, c(2, 6, 7, 8, 13, 16, 18, 20, 24, 25, 26)], as.factor(train$Attrition), laplace = 1)
predictions <- predict(model2, test[, c(2, 6, 7, 8, 13, 16, 18, 20, 24, 25, 26)])
confMatrix <- table(predictions, as.factor(test$Attrition))
CM_NB_Select <- confusionMatrix(confMatrix)
masterAcc[j] <- CM_NB_Select$overall[1]
}
CM_NB_Select
## Confusion Matrix and Statistics
##
##
## predictions Yes No
## Yes 3 4
## No 22 145
##
## Accuracy : 0.8506
## 95% CI : (0.7888, 0.9)
## No Information Rate : 0.8563
## P-Value [Acc > NIR] : 0.6357048
##
## Kappa : 0.133
##
## Mcnemar's Test P-Value : 0.0008561
##
## Sensitivity : 0.12000
## Specificity : 0.97315
## Pos Pred Value : 0.42857
## Neg Pred Value : 0.86826
## Prevalence : 0.14368
## Detection Rate : 0.01724
## Detection Prevalence : 0.04023
## Balanced Accuracy : 0.54658
##
## 'Positive' Class : Yes
##
MeanAcc = colMeans(masterAcc)
MeanAcc
## [1] 0.8312069
#use predict function on the "validation" sets. Use the same model. Test will be validation set.
Predictions<- predict(model2,AttritionTest[,c(2, 5, 6, 7, 12, 15, 17, 19, 23, 24, 25)])
AttritionTest$Attrition<- Predictions
#View(AttritionTest$Attrition)
AttritionPredictionsNB<- AttritionTest %>% select(c("ID","Attrition"))
write.csv(AttritionPredictionsNB,"CaseStudy2AttritionPredictionsNB_RenuKarthikeyan.csv", row.names = T)
Sensitivity is 12%; 12% of actual attrition cases are correctly identified by the model. This suggests that the model may not be very effective at capturing employees who are truly at risk of attrition. This model has a higher accuracy of 85.06%, and has imbalanced performance with low sensitivity and high specificity. However, accuracy can be misleading, especially in imbalanced datasets where one class (e.g., “No attrition”) dominates. In this case, accuracy is not the best metric to evaluate the model’s performance. Positive predictive value (PPV) is at 42.86%. This indicates that when the model predicts attrition, there’s a 42.86% chance that the prediction is correct. It reflects the precision of the model in identifying true positive cases among all instances predicted as positive. The low prevalence (the proportion of actual positive cases in the dataset) of attrition, is 14.37%. This low prevalence contributes to the imbalanced nature of the performance metrics.
It looks like the KNN model at k = 7 with all predictors without threshold adjustment has better accuracy within the KNN models. The Naive Bayes model with select predictors has the highest accuracy, and is better than the other Naive Bayes model which included all predictors.
Naive Bayes has a higher accuracy (85.06%) compared to KNN (79.89%). However, accuracy alone may not be the most informative metric, especially in imbalanced datasets.KNN has a higher sensitivity (40.00%) compared to Naive Bayes (12.00%). Sensitivity is crucial when identifying cases of attrition, as it represents the proportion of true positive cases among all actual positive cases.Naive Bayes has higher specificity (97.32%) compared to KNN (81.07%). Specificity is important when minimizing false positives, but it’s essential to balance it with sensitivity.Naïve Bayes has a higher positive predictive value (precision) at 42.86%, while KNN has a lower precision at 5.88%. Precision indicates the accuracy of positive predictions.
Of the 2 best models (best in KNN and best in Naive Bayes), I think the Naive Bayes is the better model to predict Attrition, given the high positive prediction value(precision), accuracy, and narrower confidence interval.
This concludes this presentation and analysis. Thank you for your time and I look forward to empowering Frito Lay with data-driven wisdom. I created a shiny app to visualize and notice insights regarding the attrition data. Please feel free to look into the link provided. If you have any questions, please feel free to reach out to me, my email is in the attached PowerPoint presentation. Thank you!